In this present world where most people run the show with the help of various credits/loans provided by banks and other financial institutions. One major sector where a lot of credits or loans are involved is the automobile sector. Nowadays the vehicle loan defaults are on the higher side causing substantial loss to the financial institutions. As a result of which the loan underwriting has become more stringent and rejections due to the bad profile of the customer have increased due to this. The demand for an effective and efficient credit risk scoring model has increased among financial organizations.
The objective of this project is to accurately model and predict the risk of a borrower defaulting on a vehicle loan in the first EMI( Equated monthly installment) on the due date. The project is divided into 2 phases. And in this report is on the Phase 2 Objective.
The original dataset is from L&T Financial Services & Analytics Vidhya presents ‘DataScience FinHack’ competition and we have taken it from https://www.kaggle.com/lampubhutia/loandefault-ltfs-avml-finhack . There were 2 datasets in the website, 'train.csv', and 'test.csv'. Train.csv is downloaded and renamed as loandefault.csv. We are considering only the train.csv in this project phase as the dataset and for data splitting, we would split the loandefault.csv in phase 2 as required. The dataset consists of 41 attributes in total, of which 40 are descriptive features and one target feature.
| Variable | Description | Data_Type | Unit | ||
|---|---|---|---|---|---|
| UniqueID | Unique ID for identifying Customers. | Nominal category | Not Applicable | ||
| disbursed_amount | Loan Amount disbursed | Numerical | Indian Currency Rupees | ||
| asset_cost | The cost of asset | Numerical | Indian Currency Rupees | ||
| ltv | Loan to Value of the asset | Numerical | Percentage | ||
| branch_id | The branch where the loan was disbursed | Nominal category | Not Applicable | ||
| supplier_id | Vehicle dealer ID where loan was disbursed | Nominal category | Not Applicable | ||
| manufacturer_id | Vehicle manufacturer(Hero, Honda, TVS etc.) | Nominal category | Not Applicable | ||
| Current_pincode | Current pincode of the customer | Nominal category | Not Applicable | ||
| Date.of.Birth | Date of birth of the customer | Numerical | DD/MM/YYY | ||
| Employment.Type | Employment Type of the customer (Salaried/Self... | Nominal category | Not Applicable | ||
| DisbursalDate | Date of disbursement | Numerical | DD/MM/YYY | ||
| State_ID | State of disbursement | Nominal category | Not Applicable | ||
| Employee_code_ID | Employee of the organization who logged the di... | Nominal category | Not Applicable | ||
| MobileNo_Avl_Flag | Flagged as 1 if the mobile number is shared by... | Nominal category - Binary | Not Applicable | ||
| Aadhar_flag | Flagged as 1 if the mobile number is shared by... | Nominal category - Binary | Not Applicable | ||
| PAN_flag | Flagged as 1 if the PAN number is shared by th... | Nominal category - Binary | Not Applicable | ||
| VoterID_flag | Flagged as 1 if the Voter ID is shared by the ... | Nominal category - Binary | Not Applicable | ||
| Driving_flag | Flagged as 1 if the Driving license is shared ... | Nominal category - Binary | Not Applicable | ||
| Passport_flag | Flagged as 1 if the Passport is shared by the ... | Nominal category - Binary | Not Applicable | ||
| PERFORM_CNS.SCORE | Bureau Score | Numerical | Not Applicable | ||
| PERFORM_CNS.SCORE.DESCRIPTION | Bureau score description | Ordinal Category | Not Applicable | ||
| PRI.NO.OF.ACCTS | Total number of loan taken by the customer at ... | Numerical | Not Applicable | ||
| PRI.ACTIVE.ACCTS | Total number of active loan at time of disburs... | Numerical | Not Applicable | ||
| PRI.OVERDUE.ACCTS | Total number of default accounts at the time o... | Numerical | Not Applicable | ||
| PRI.CURRENT.BALANCE | Total principal outstanding amount of the acti... | Numerical | Indian Currency Rupees | ||
| PRI.SANCTIONED.AMOUNT | Total amount that was sanctioned for all the l... | Numerical | Indian Currency Rupees | ||
| PRI.DISBURSED.AMOUNT | Total amount that was disbursed for all the lo... | Numerical | Indian Currency Rupees | ||
| SEC.NO.OF.ACCTS | Total number of loan taken by the customer at ... | Numerical | Not Applicable | ||
| SEC.ACTIVE.ACCTS | Total number of active loan at time of disburs... | Numerical | Not Applicable | ||
| SEC.OVERDUE.ACCTS | Total number of default accounts at the time o... | Numerical | Not Applicable | ||
| SEC.CURRENT.BALANCE | Total principal outstanding amount of the acti... | Numerical | Indian Currency Rupees | ||
| SEC.SANCTIONED.AMOUNT | Total amount that was sanctioned for all the l... | Numerical | Indian Currency Rupees | ||
| SEC.DISBURSED.AMOUNT | Total amount that was disbursed for all the lo... | Numerical | Indian Currency Rupees | ||
| PRIMARY.INSTAL.AMT | EMI Amount of the primary loan | Numerical | Indian Currency Rupees | ||
| SEC.INSTAL.AMT | EMI Amount of the secondary loan | Numerical | Indian Currency Rupees | ||
| NEW.ACCTS.IN.LAST.SIX.MONTHS | New loans taken by the customer in last 6 mont... | Numerical | Not Applicable | ||
| DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS | Loans defaulted in the last 6 months | Numerical | Not Applicable | ||
| AVERAGE.ACCT.AGE | Average loan tenure | Numerical | MM/YY | ||
| CREDIT.HISTORY.LENGTH | Time since first loan | Numerical | MM/YY | ||
| NO.OF_INQUIRIES | Enquries done by the customer for loans | Numerical | Not Applicable | ||
| loan_default | whether payment default on the first EMI. 0:no... | Nominal category - Binary | Not Applicable |
The target feature is :
loan_default : It is payment default on the first EMI. The target feature has two levels, 0 and 1.
0 denoting no default in the first EMI payment 1 denoting default in the first EMI paymentOverall approach towards building an accurate predictive modeling for this report are as follows:
Feature Selection
Feature selection is made to reduce the number of input variable while developing the models.
Selection of 15 most important features will be made and these features will be used for training all the considered models.
Sampling & Train Test Data Splitting
Random Sample of 100000 are selected from the datasets of further splitting.
To check the precision and error rate the dataset will be divided into Train(70%) and Test data (30%)
Model fitting and Hyperparameter Tuning
The following 5 classification models will be developed and tested using the loan default dataset to predict the first EMI default by the borrower :
K- Nearest Neighbor
The k nearest neighbor (KNN) is one of the most widely used Machine learning algorithm. The k-nearest neighbors model predicts the target level with the majority vote from the set of k-nearest neighbor to the query.To model kNN algorithm for prediction in various values of p (Euclidean Distance and Manhattan Distance) and n_neighbor. GridSearchCv is performed to choose the best parameter to create the Final Model.
Decision Tree
Decision tree classifies features by sorting them down the tree from the root to leaf node. Root node is the top-most node of the decision tree which represents the entire sample of the dataset. A parent node is a node that divides into sub nodes and these sub-nodes are known as child node. A node that do not have any child node are called leave node or terminal.
To model Decision tree algorithm for prediction in this report, various values of Criterion, max depth and minimum splits are considered. GridSearchCv is performed to choose the best parameter to create the Final Model.
Naive Bayes
The Bayesian classification is a supervised machine learning method. The classification is named after Thomas Bayes who proposed the Bayes Theorem. Naive Bayes is the one of the suggested one if the dataset has a big range of instances and attributes.
Bagging
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.
Random Forest
Random forest is based on Decision tree. Random Forest is a group of out put from number of shallow trees rather that building a single tree. Experiment of various values of Criterion (Gini index or Entropy) and n_estimators will be conducted with the help of GridSearchCv To identify the best parameters and to model the best Random Forest.
#Importing Modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as mpt
import seaborn as sb
np.random.seed(999)
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
import sklearn.metrics as metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn import feature_selection as fs
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from numpy import argsort
from sklearn.model_selection import GridSearchCV
# Set a seed value
seed_value = 999
# 1. Initialise `PYTHONHASHSEED` environment variable
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
# 2. Initialise Python's own pseudo-random generator
import random
random.seed(seed_value)
# 3. Initialise Numpy's pseudo-random generator
import numpy as np
np.random.seed(seed_value)
ldd=pd.read_csv('Data_Group45.csv',sep=',')
pd.set_option('display.max_columns', None)
ldd.head()
| UniqueID | disbursed_amount | asset_cost | ltv | branch_id | supplier_id | manufacturer_id | Current_pincode_ID | Date.of.Birth | Employment.Type | DisbursalDate | State_ID | Employee_code_ID | MobileNo_Avl_Flag | Aadhar_flag | PAN_flag | VoterID_flag | Driving_flag | Passport_flag | PERFORM_CNS.SCORE | PERFORM_CNS.SCORE.DESCRIPTION | PRI.NO.OF.ACCTS | PRI.ACTIVE.ACCTS | PRI.OVERDUE.ACCTS | PRI.CURRENT.BALANCE | PRI.SANCTIONED.AMOUNT | PRI.DISBURSED.AMOUNT | SEC.NO.OF.ACCTS | SEC.ACTIVE.ACCTS | SEC.OVERDUE.ACCTS | SEC.CURRENT.BALANCE | SEC.SANCTIONED.AMOUNT | SEC.DISBURSED.AMOUNT | PRIMARY.INSTAL.AMT | SEC.INSTAL.AMT | NEW.ACCTS.IN.LAST.SIX.MONTHS | DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS | AVERAGE.ACCT.AGE | CREDIT.HISTORY.LENGTH | NO.OF_INQUIRIES | loan_default | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 420825 | 50578 | 58400 | 89.55 | 67 | 22807 | 45 | 1441 | 01/01/1984 | Salaried | 03/08/2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 1 | 537409 | 47145 | 65550 | 73.23 | 67 | 22807 | 45 | 1502 | 31/07/1985 | Self employed | 26/09/2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 598 | I-Medium Risk | 1 | 1 | 1 | 27600 | 50200 | 50200 | 0 | 0 | 0 | 0 | 0 | 0 | 1991 | 0 | 0 | 1 | 1yrs 11mon | 1yrs 11mon | 0 | 1 |
| 2 | 417566 | 53278 | 61360 | 89.63 | 67 | 22807 | 45 | 1497 | 24/08/1985 | Self employed | 01/08/2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 3 | 624493 | 57513 | 66113 | 88.48 | 67 | 22807 | 45 | 1501 | 30/12/1993 | Self employed | 26/10/2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 305 | L-Very High Risk | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 0yrs 8mon | 1yrs 3mon | 1 | 1 |
| 4 | 539055 | 52378 | 60300 | 88.39 | 67 | 22807 | 45 | 1495 | 09/12/1977 | Self employed | 26/09/2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 1 | 1 |
ldd.columns
Index(['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id',
'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Date.of.Birth',
'Employment.Type', 'DisbursalDate', 'State_ID', 'Employee_code_ID',
'MobileNo_Avl_Flag', 'Aadhar_flag', 'PAN_flag', 'VoterID_flag',
'Driving_flag', 'Passport_flag', 'PERFORM_CNS.SCORE',
'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS',
'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT',
'PRI.DISBURSED.AMOUNT', 'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS',
'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT',
'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT',
'NEW.ACCTS.IN.LAST.SIX.MONTHS', 'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS',
'AVERAGE.ACCT.AGE', 'CREDIT.HISTORY.LENGTH', 'NO.OF_INQUIRIES',
'loan_default'],
dtype='object')
ldd.drop(['UniqueID','branch_id','manufacturer_id','supplier_id','State_ID','Current_pincode_ID','Employee_code_ID'],axis=1,inplace=True)
print('Missing values')
ldd.isnull().sum()
Missing values
disbursed_amount 0 asset_cost 0 ltv 0 Date.of.Birth 0 Employment.Type 7661 DisbursalDate 0 MobileNo_Avl_Flag 0 Aadhar_flag 0 PAN_flag 0 VoterID_flag 0 Driving_flag 0 Passport_flag 0 PERFORM_CNS.SCORE 0 PERFORM_CNS.SCORE.DESCRIPTION 0 PRI.NO.OF.ACCTS 0 PRI.ACTIVE.ACCTS 0 PRI.OVERDUE.ACCTS 0 PRI.CURRENT.BALANCE 0 PRI.SANCTIONED.AMOUNT 0 PRI.DISBURSED.AMOUNT 0 SEC.NO.OF.ACCTS 0 SEC.ACTIVE.ACCTS 0 SEC.OVERDUE.ACCTS 0 SEC.CURRENT.BALANCE 0 SEC.SANCTIONED.AMOUNT 0 SEC.DISBURSED.AMOUNT 0 PRIMARY.INSTAL.AMT 0 SEC.INSTAL.AMT 0 NEW.ACCTS.IN.LAST.SIX.MONTHS 0 DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS 0 AVERAGE.ACCT.AGE 0 CREDIT.HISTORY.LENGTH 0 NO.OF_INQUIRIES 0 loan_default 0 dtype: int64
ldd.dropna(inplace=True)
print('Missing values')
ldd.isnull().sum()
Missing values
disbursed_amount 0 asset_cost 0 ltv 0 Date.of.Birth 0 Employment.Type 0 DisbursalDate 0 MobileNo_Avl_Flag 0 Aadhar_flag 0 PAN_flag 0 VoterID_flag 0 Driving_flag 0 Passport_flag 0 PERFORM_CNS.SCORE 0 PERFORM_CNS.SCORE.DESCRIPTION 0 PRI.NO.OF.ACCTS 0 PRI.ACTIVE.ACCTS 0 PRI.OVERDUE.ACCTS 0 PRI.CURRENT.BALANCE 0 PRI.SANCTIONED.AMOUNT 0 PRI.DISBURSED.AMOUNT 0 SEC.NO.OF.ACCTS 0 SEC.ACTIVE.ACCTS 0 SEC.OVERDUE.ACCTS 0 SEC.CURRENT.BALANCE 0 SEC.SANCTIONED.AMOUNT 0 SEC.DISBURSED.AMOUNT 0 PRIMARY.INSTAL.AMT 0 SEC.INSTAL.AMT 0 NEW.ACCTS.IN.LAST.SIX.MONTHS 0 DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS 0 AVERAGE.ACCT.AGE 0 CREDIT.HISTORY.LENGTH 0 NO.OF_INQUIRIES 0 loan_default 0 dtype: int64
ldd=ldd.rename(columns={"ltv":"Loan to Value of the asset",
"PERFORM_CNS.SCORE":"Bureau Score",
"PERFORM_CNS.SCORE.DESCRIPTION":"Bureau score description",
"PRIMARY.INSTAL.AMT":"Primary_EMI",
"SEC.INSTAL.AMT":"Secondary_EMI",
"DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS":"Default_Acct_in_6_months",
"AVERAGE.ACCT.AGE":"Average_Tenure","Date.of.Birth":"DOB"})
ldd.head()
| disbursed_amount | asset_cost | Loan to Value of the asset | DOB | Employment.Type | DisbursalDate | MobileNo_Avl_Flag | Aadhar_flag | PAN_flag | VoterID_flag | Driving_flag | Passport_flag | Bureau Score | Bureau score description | PRI.NO.OF.ACCTS | PRI.ACTIVE.ACCTS | PRI.OVERDUE.ACCTS | PRI.CURRENT.BALANCE | PRI.SANCTIONED.AMOUNT | PRI.DISBURSED.AMOUNT | SEC.NO.OF.ACCTS | SEC.ACTIVE.ACCTS | SEC.OVERDUE.ACCTS | SEC.CURRENT.BALANCE | SEC.SANCTIONED.AMOUNT | SEC.DISBURSED.AMOUNT | Primary_EMI | Secondary_EMI | NEW.ACCTS.IN.LAST.SIX.MONTHS | Default_Acct_in_6_months | Average_Tenure | CREDIT.HISTORY.LENGTH | NO.OF_INQUIRIES | loan_default | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50578 | 58400 | 89.55 | 01/01/1984 | Salaried | 03/08/2018 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 1 | 47145 | 65550 | 73.23 | 31/07/1985 | Self employed | 26/09/2018 | 1 | 1 | 0 | 0 | 0 | 0 | 598 | I-Medium Risk | 1 | 1 | 1 | 27600 | 50200 | 50200 | 0 | 0 | 0 | 0 | 0 | 0 | 1991 | 0 | 0 | 1 | 1yrs 11mon | 1yrs 11mon | 0 | 1 |
| 2 | 53278 | 61360 | 89.63 | 24/08/1985 | Self employed | 01/08/2018 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 3 | 57513 | 66113 | 88.48 | 30/12/1993 | Self employed | 26/10/2018 | 1 | 1 | 0 | 0 | 0 | 0 | 305 | L-Very High Risk | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 0yrs 8mon | 1yrs 3mon | 1 | 1 |
| 4 | 52378 | 60300 | 88.39 | 09/12/1977 | Self employed | 26/09/2018 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 1 | 1 |
ldd['DOB']=ldd['DOB'].astype('datetime64[ns]')
ldd['DisbursalDate']=ldd['DisbursalDate'].astype('datetime64[ns]')
print(ldd.dtypes)
disbursed_amount int64 asset_cost int64 Loan to Value of the asset float64 DOB datetime64[ns] Employment.Type object DisbursalDate datetime64[ns] MobileNo_Avl_Flag int64 Aadhar_flag int64 PAN_flag int64 VoterID_flag int64 Driving_flag int64 Passport_flag int64 Bureau Score int64 Bureau score description object PRI.NO.OF.ACCTS int64 PRI.ACTIVE.ACCTS int64 PRI.OVERDUE.ACCTS int64 PRI.CURRENT.BALANCE int64 PRI.SANCTIONED.AMOUNT int64 PRI.DISBURSED.AMOUNT int64 SEC.NO.OF.ACCTS int64 SEC.ACTIVE.ACCTS int64 SEC.OVERDUE.ACCTS int64 SEC.CURRENT.BALANCE int64 SEC.SANCTIONED.AMOUNT int64 SEC.DISBURSED.AMOUNT int64 Primary_EMI int64 Secondary_EMI int64 NEW.ACCTS.IN.LAST.SIX.MONTHS int64 Default_Acct_in_6_months int64 Average_Tenure object CREDIT.HISTORY.LENGTH object NO.OF_INQUIRIES int64 loan_default int64 dtype: object
from datetime import date
def calculate_age(row):
disb=row.DisbursalDate
birthdate=row['DOB']
Age=disb.year-birthdate.year-((disb.month,disb.day)<(birthdate.month, birthdate.day))
return Age
Age=ldd.apply(calculate_age,axis=1)
print(Age)
ldd['Age'] = Age
0 34
1 33
2 32
3 24
4 41
..
233149 30
233150 30
233151 42
233152 24
233153 34
Length: 225493, dtype: int64
def avgtenure(a):
splitdata=a.split(" ")
year=int(splitdata[0].split("y")[0])
month=int(splitdata[1].split("m")[0])
tenure=(12*year)+month
return tenure
ldd['Average_Tenure']=ldd['Average_Tenure'].apply(avgtenure)
ldd['CREDIT.HISTORY.LENGTH']=ldd['CREDIT.HISTORY.LENGTH'].apply(avgtenure)
ldd['Bureau score description']=ldd['Bureau score description'].replace(('No Bureau History Available','Not Scored: Sufficient History Not Available',
'Not Scored: Not Enough Info available on the customer','Not Scored: No Activity seen on the customer (Inactive)',
'Not Scored: No Updates available in last 36 months','Not Scored: Only a Guarantor',
'Not Scored: More than 50 active Accounts found'),0)
ldd['Bureau score description']=ldd['Bureau score description'].replace(('C-Very Low Risk','A-Very Low Risk','D-Very Low Risk','B-Very Low Risk'),1)
ldd['Bureau score description']=ldd['Bureau score description'].replace(('E-Low Risk','F-Low Risk','G-Low Risk'),2)
ldd['Bureau score description']=ldd['Bureau score description'].replace(('H-Medium Risk','I-Medium Risk'),3)
ldd['Bureau score description']=ldd['Bureau score description'].replace(('J-High Risk','K-High Risk'),4)
ldd['Bureau score description']=ldd['Bureau score description'].replace(('L-Very High Risk','M-Very High Risk'),5)
ldd['Bureau score description'].value_counts()
0 124253 1 49671 2 17906 3 12135 4 11774 5 9754 Name: Bureau score description, dtype: int64
ldd.drop(['DOB','DisbursalDate'],axis=1,inplace=True)
ldd.shape
(225493, 33)
ldd = pd.get_dummies(ldd)
ldd.dtypes
disbursed_amount int64 asset_cost int64 Loan to Value of the asset float64 MobileNo_Avl_Flag int64 Aadhar_flag int64 PAN_flag int64 VoterID_flag int64 Driving_flag int64 Passport_flag int64 Bureau Score int64 Bureau score description int64 PRI.NO.OF.ACCTS int64 PRI.ACTIVE.ACCTS int64 PRI.OVERDUE.ACCTS int64 PRI.CURRENT.BALANCE int64 PRI.SANCTIONED.AMOUNT int64 PRI.DISBURSED.AMOUNT int64 SEC.NO.OF.ACCTS int64 SEC.ACTIVE.ACCTS int64 SEC.OVERDUE.ACCTS int64 SEC.CURRENT.BALANCE int64 SEC.SANCTIONED.AMOUNT int64 SEC.DISBURSED.AMOUNT int64 Primary_EMI int64 Secondary_EMI int64 NEW.ACCTS.IN.LAST.SIX.MONTHS int64 Default_Acct_in_6_months int64 Average_Tenure int64 CREDIT.HISTORY.LENGTH int64 NO.OF_INQUIRIES int64 loan_default int64 Age int64 Employment.Type_Salaried uint8 Employment.Type_Self employed uint8 dtype: object
ldd.shape
(225493, 34)
We perform a set of preliminary activities before modeling.
Machine learning algorithm works better when the feature value are on a similar scale. Hence values of the features are scaled between the range 0 and 1 without changing the shape of the distribution using the min-max scaler from Scikit-Learn library.
Scaled Value $ = \frac{X − X min}{ X max − X min} $
Data=ldd.drop(columns=['loan_default'],axis=1)
Target=ldd.loan_default
Data
| disbursed_amount | asset_cost | Loan to Value of the asset | MobileNo_Avl_Flag | Aadhar_flag | PAN_flag | VoterID_flag | Driving_flag | Passport_flag | Bureau Score | Bureau score description | PRI.NO.OF.ACCTS | PRI.ACTIVE.ACCTS | PRI.OVERDUE.ACCTS | PRI.CURRENT.BALANCE | PRI.SANCTIONED.AMOUNT | PRI.DISBURSED.AMOUNT | SEC.NO.OF.ACCTS | SEC.ACTIVE.ACCTS | SEC.OVERDUE.ACCTS | SEC.CURRENT.BALANCE | SEC.SANCTIONED.AMOUNT | SEC.DISBURSED.AMOUNT | Primary_EMI | Secondary_EMI | NEW.ACCTS.IN.LAST.SIX.MONTHS | Default_Acct_in_6_months | Average_Tenure | CREDIT.HISTORY.LENGTH | NO.OF_INQUIRIES | Age | Employment.Type_Salaried | Employment.Type_Self employed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50578 | 58400 | 89.55 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | 1 | 0 |
| 1 | 47145 | 65550 | 73.23 | 1 | 1 | 0 | 0 | 0 | 0 | 598 | 3 | 1 | 1 | 1 | 27600 | 50200 | 50200 | 0 | 0 | 0 | 0 | 0 | 0 | 1991 | 0 | 0 | 1 | 23 | 23 | 0 | 33 | 0 | 1 |
| 2 | 53278 | 61360 | 89.63 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 32 | 0 | 1 |
| 3 | 57513 | 66113 | 88.48 | 1 | 1 | 0 | 0 | 0 | 0 | 305 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 8 | 15 | 1 | 24 | 0 | 1 |
| 4 | 52378 | 60300 | 88.39 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 41 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 233149 | 63213 | 105405 | 60.72 | 1 | 0 | 0 | 1 | 0 | 0 | 735 | 1 | 4 | 3 | 0 | 390443 | 416133 | 416133 | 0 | 0 | 0 | 0 | 0 | 0 | 4084 | 0 | 0 | 0 | 21 | 39 | 0 | 30 | 1 | 0 |
| 233150 | 73651 | 100600 | 74.95 | 1 | 0 | 0 | 1 | 0 | 0 | 825 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1565 | 0 | 0 | 0 | 6 | 6 | 0 | 30 | 0 | 1 |
| 233151 | 33484 | 71212 | 48.45 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42 | 1 | 0 |
| 233152 | 34259 | 73286 | 49.10 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 1 | 0 |
| 233153 | 75751 | 116009 | 66.81 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | 1 | 0 |
225493 rows × 33 columns
Target
0 0
1 1
2 0
3 1
4 1
..
233149 0
233150 0
233151 0
233152 0
233153 0
Name: loan_default, Length: 225493, dtype: int64
Data_df = Data.copy()
from sklearn import preprocessing
Data_scaler = preprocessing.MinMaxScaler()
Data_scaler.fit(Data)
Data = Data_scaler.fit_transform(Data)
Data=pd.DataFrame(Data, columns=Data_df.columns)
Data
| disbursed_amount | asset_cost | Loan to Value of the asset | MobileNo_Avl_Flag | Aadhar_flag | PAN_flag | VoterID_flag | Driving_flag | Passport_flag | Bureau Score | Bureau score description | PRI.NO.OF.ACCTS | PRI.ACTIVE.ACCTS | PRI.OVERDUE.ACCTS | PRI.CURRENT.BALANCE | PRI.SANCTIONED.AMOUNT | PRI.DISBURSED.AMOUNT | SEC.NO.OF.ACCTS | SEC.ACTIVE.ACCTS | SEC.OVERDUE.ACCTS | SEC.CURRENT.BALANCE | SEC.SANCTIONED.AMOUNT | SEC.DISBURSED.AMOUNT | Primary_EMI | Secondary_EMI | NEW.ACCTS.IN.LAST.SIX.MONTHS | Default_Acct_in_6_months | Average_Tenure | CREDIT.HISTORY.LENGTH | NO.OF_INQUIRIES | Age | Employment.Type_Salaried | Employment.Type_Self employed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038251 | 0.016564 | 0.933129 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.318182 | 1.0 | 0.0 |
| 1 | 0.034727 | 0.022098 | 0.732883 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.671910 | 0.6 | 0.002208 | 0.006944 | 0.04 | 0.064978 | 0.000050 | 0.000050 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000078 | 0.0 | 0.0 | 0.05 | 0.062331 | 0.049145 | 0.000000 | 0.295455 | 0.0 | 1.0 |
| 2 | 0.041023 | 0.018855 | 0.934110 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.272727 | 0.0 | 1.0 |
| 3 | 0.045371 | 0.022534 | 0.920000 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.342697 | 1.0 | 0.006623 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000001 | 0.0 | 0.0 | 0.00 | 0.021680 | 0.032051 | 0.027778 | 0.090909 | 0.0 | 1.0 |
| 4 | 0.040099 | 0.018035 | 0.918896 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.000000 | 0.000000 | 0.027778 | 0.477273 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 225488 | 0.051223 | 0.052947 | 0.579387 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.825843 | 0.2 | 0.008830 | 0.020833 | 0.00 | 0.068493 | 0.000416 | 0.000416 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000159 | 0.0 | 0.0 | 0.00 | 0.056911 | 0.083333 | 0.000000 | 0.227273 | 1.0 | 0.0 |
| 225489 | 0.061939 | 0.049228 | 0.753988 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.926966 | 0.2 | 0.002208 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000061 | 0.0 | 0.0 | 0.00 | 0.016260 | 0.012821 | 0.000000 | 0.227273 | 0.0 | 1.0 |
| 225490 | 0.020702 | 0.026481 | 0.428834 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | 1.0 | 0.0 |
| 225491 | 0.021497 | 0.028086 | 0.436810 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.090909 | 1.0 | 0.0 |
| 225492 | 0.064095 | 0.061155 | 0.654110 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.00 | 0.064710 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.015698 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.318182 | 1.0 | 0.0 |
225493 rows × 33 columns
The dataset Data consist of a 33 features. Using all the descriptive features might lead to over fitting, computational times and poor performance. Feature selection is a practice of selecting the best features from the dataset which can efficiently and effectively performs with the selected machine learning models. Here, the Random Forest Importance is used to select top 15 features for prediction.
from sklearn.ensemble import RandomForestClassifier
num_features = 15
model_rfi = RandomForestClassifier(n_estimators=100)
model_rfi.fit(Data, Target)
fs_indices_rfi = np.argsort(model_rfi.feature_importances_)[::-1][0:num_features]
best_features_rfi = Data.columns[fs_indices_rfi].values
best_features_rfi
array(['Loan to Value of the asset', 'asset_cost', 'disbursed_amount',
'Age', 'CREDIT.HISTORY.LENGTH', 'PRI.CURRENT.BALANCE',
'Bureau Score', 'Primary_EMI', 'Average_Tenure',
'PRI.DISBURSED.AMOUNT', 'PRI.SANCTIONED.AMOUNT', 'PRI.NO.OF.ACCTS',
'NO.OF_INQUIRIES', 'PRI.ACTIVE.ACCTS',
'NEW.ACCTS.IN.LAST.SIX.MONTHS'], dtype=object)
feature_importances_rfi = model_rfi.feature_importances_[fs_indices_rfi]
feature_importances_rfi
array([0.18761273, 0.18611774, 0.17569818, 0.11131995, 0.03391046,
0.03311617, 0.03247854, 0.03177459, 0.03140901, 0.03132836,
0.03130724, 0.02003184, 0.0141609 , 0.01159445, 0.00859522])
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
def plot_imp(best_features, scores, method_name):
plt.barh(best_features, scores)
plt.title(method_name + ' Feature Importances')
plt.xlabel("Importance")
plt.ylabel("Features")
plt.show()
plot_imp(best_features_rfi, feature_importances_rfi, 'Random Forest')
From the above figure it can be observed that top the most important feature for modeling are:
The selected Features are stored as a data frame, Data_best.This is further used for prediction.
Data_best=pd.DataFrame(Data, columns=['Loan to Value of the asset', 'asset_cost', 'disbursed_amount','Age', 'CREDIT.HISTORY.LENGTH',
'PRI.CURRENT.BALANCE','Bureau Score', 'Primary_EMI', 'Average_Tenure','PRI.DISBURSED.AMOUNT',
'PRI.SANCTIONED.AMOUNT', 'PRI.NO.OF.ACCTS','NO.OF_INQUIRIES', 'PRI.ACTIVE.ACCTS',
'NEW.ACCTS.IN.LAST.SIX.MONTHS'])
Data_best
| Loan to Value of the asset | asset_cost | disbursed_amount | Age | CREDIT.HISTORY.LENGTH | PRI.CURRENT.BALANCE | Bureau Score | Primary_EMI | Average_Tenure | PRI.DISBURSED.AMOUNT | PRI.SANCTIONED.AMOUNT | PRI.NO.OF.ACCTS | NO.OF_INQUIRIES | PRI.ACTIVE.ACCTS | NEW.ACCTS.IN.LAST.SIX.MONTHS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.933129 | 0.016564 | 0.038251 | 0.318182 | 0.000000 | 0.064710 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 1 | 0.732883 | 0.022098 | 0.034727 | 0.295455 | 0.049145 | 0.064978 | 0.671910 | 0.000078 | 0.062331 | 0.000050 | 0.000050 | 0.002208 | 0.000000 | 0.006944 | 0.0 |
| 2 | 0.934110 | 0.018855 | 0.041023 | 0.272727 | 0.000000 | 0.064710 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 3 | 0.920000 | 0.022534 | 0.045371 | 0.090909 | 0.032051 | 0.064710 | 0.342697 | 0.000001 | 0.021680 | 0.000000 | 0.000000 | 0.006623 | 0.027778 | 0.000000 | 0.0 |
| 4 | 0.918896 | 0.018035 | 0.040099 | 0.477273 | 0.000000 | 0.064710 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.027778 | 0.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 225488 | 0.579387 | 0.052947 | 0.051223 | 0.227273 | 0.083333 | 0.068493 | 0.825843 | 0.000159 | 0.056911 | 0.000416 | 0.000416 | 0.008830 | 0.000000 | 0.020833 | 0.0 |
| 225489 | 0.753988 | 0.049228 | 0.061939 | 0.227273 | 0.012821 | 0.064710 | 0.926966 | 0.000061 | 0.016260 | 0.000000 | 0.000000 | 0.002208 | 0.000000 | 0.000000 | 0.0 |
| 225490 | 0.428834 | 0.026481 | 0.020702 | 0.500000 | 0.000000 | 0.064710 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 225491 | 0.436810 | 0.028086 | 0.021497 | 0.090909 | 0.000000 | 0.064710 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 225492 | 0.654110 | 0.061155 | 0.064095 | 0.318182 | 0.000000 | 0.064710 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
225493 rows × 15 columns
The original dataset consist of 225493 rows. For this report a random sample of 100000 are considered. Additionally the dataset are separated into Data_sample and Target_sample.
The new datasets Data_sample and Target_sample are split into train and test data. 70% is used for training the models and 30% is used for testing.
n_samples = 100000
Data_sample = pd.DataFrame(Data_best).sample(n=n_samples, random_state=999).values
target_sample = pd.DataFrame(Target).sample(n=n_samples, random_state=999).values
print(Data_sample.shape)
print(target_sample.shape)
(100000, 15) (100000, 1)
from sklearn.model_selection import train_test_split
Data_sample_train, Data_sample_test, \
target_sample_train, target_sample_test = train_test_split(Data_sample, target_sample,
test_size = 0.3, random_state=999,
stratify = target_sample)
print(Data_sample_train.shape)
print(Data_sample_test.shape)
(70000, 15) (30000, 15)
from sklearn.model_selection import RepeatedStratifiedKFold
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=1,
random_state=999)
Two arguments are set for modeling K-Nearest Neighbor, they are:
Hyper Parameter tuning is performed with with different values of n_neighbor and p to get select the best parameters for the decision tree.
The values considered are as follows:
import numpy as np
params_KNN = {'n_neighbors': [1,5,10,15,20,30],
'p': [1, 2]}
from sklearn.model_selection import GridSearchCV
gs_KNN = GridSearchCV(KNeighborsClassifier(),
param_grid=params_KNN,
cv=cv_method,
verbose=1,
scoring= 'roc_auc',
return_train_score=True)
gs_KNN.fit(Data_sample_train, target_sample_train)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=5, random_state=999),
estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [1, 5, 10, 15, 20, 30], 'p': [1, 2]},
return_train_score=True, scoring='roc_auc', verbose=1)
gs_KNN.best_params_
{'n_neighbors': 30, 'p': 1}
accuracy = gs_KNN.best_score_
print('The accuracy rate is :',accuracy)
The accuracy rate is : 0.5907847401901392
The above output suggests that the best model is derived with the following parameter:
gs_KNN.cv_results_['mean_test_score']
array([0.52024281, 0.51929357, 0.55019497, 0.54871563, 0.56466227,
0.56181297, 0.57608443, 0.57440565, 0.58430207, 0.58079322,
0.59078474, 0.58965405])
results_KNN = pd.DataFrame(gs_KNN.cv_results_['params'])
results_KNN['test_score'] = gs_KNN.cv_results_['mean_test_score']
results_KNN['metric'] =results_KNN['p'].replace([1,2], ["Manhattan", "Euclidean"])
results_KNN
| n_neighbors | p | test_score | metric | |
|---|---|---|---|---|
| 0 | 1 | 1 | 0.520243 | Manhattan |
| 1 | 1 | 2 | 0.519294 | Euclidean |
| 2 | 5 | 1 | 0.550195 | Manhattan |
| 3 | 5 | 2 | 0.548716 | Euclidean |
| 4 | 10 | 1 | 0.564662 | Manhattan |
| 5 | 10 | 2 | 0.561813 | Euclidean |
| 6 | 15 | 1 | 0.576084 | Manhattan |
| 7 | 15 | 2 | 0.574406 | Euclidean |
| 8 | 20 | 1 | 0.584302 | Manhattan |
| 9 | 20 | 2 | 0.580793 | Euclidean |
| 10 | 30 | 1 | 0.590785 | Manhattan |
| 11 | 30 | 2 | 0.589654 | Euclidean |
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
for i in ["Manhattan", "Euclidean"]:
temp = results_KNN[results_KNN['metric'] == i]
plt.plot(temp['n_neighbors'], temp['test_score'], marker = '.', label = i)
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel("Mean CV Score")
plt.title("KNN Hyperparameter Performance Comparison")
plt.show()
From the above graph we can observe that as the K value increases the Manhattan mean CV score rises. Euclidean distance score also seems to be rising but below the Manhattan distance line.
knn1 = KNeighborsClassifier(n_neighbors=30 , p=1 )
knn= knn1.fit(Data_sample_train, target_sample_train)
knn.score(Data_sample_train, target_sample_train)
0.7844
pred_KNN = knn.predict(Data_sample_test)
from sklearn.metrics import accuracy_score
ac_KNN = accuracy_score (target_sample_test, pred_KNN)
ac_KNN
0.7834333333333333
from sklearn.metrics import confusion_matrix
print("\nConfusion matrix for K-Nearest Neighbor")
cm=confusion_matrix(target_sample_test, pred_KNN)
cm
Confusion matrix for K-Nearest Neighbor
array([[23471, 54],
[ 6443, 32]])
cmap = sb.cubehelix_palette(50, hue=0.05, rot=0, light=0.5, dark=0, as_cmap=False)
sb.heatmap(cm, cmap = cmap,xticklabels=['Prediction Negetive','Prediction Positive'],yticklabels=['Actual Negetive','Actual Positive'], annot=True,
fmt='d')
<AxesSubplot:>
from sklearn import metrics
print("\nK-Nearest Neighbor - Classification report")
cr= print(metrics.classification_report(target_sample_test, pred_KNN))
K-Nearest Neighbor - Classification report
precision recall f1-score support
0 0.78 1.00 0.88 23525
1 0.37 0.00 0.01 6475
accuracy 0.78 30000
macro avg 0.58 0.50 0.44 30000
weighted avg 0.70 0.78 0.69 30000
metrics.roc_auc_score(target_sample_test, pred_KNN)
0.5013233272744431
t_prob = gs_KNN.predict_proba(Data_sample_test)
t_prob[0:10]
array([[0.83333333, 0.16666667],
[0.76666667, 0.23333333],
[0.66666667, 0.33333333],
[0.6 , 0.4 ],
[0.9 , 0.1 ],
[0.9 , 0.1 ],
[0.86666667, 0.13333333],
[0.83333333, 0.16666667],
[0.96666667, 0.03333333],
[1. , 0. ]])
fpr, tpr, _ = metrics.roc_curve(target_sample_test, t_prob[:, 1])
roc_auc = metrics.auc(fpr, tpr)
roc_auc
0.5902613124130658
import pandas as pd
dfKNN = pd.DataFrame({'FPR': fpr, 'TPR': tpr})
dfKNN.head()
| FPR | TPR | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000128 | 0.000000 |
| 2 | 0.000340 | 0.000463 |
| 3 | 0.001020 | 0.001853 |
| 4 | 0.002295 | 0.004942 |
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
ax = dfKNN.plot.line(x='FPR', y='TPR', title='ROC Curve', legend=False, marker = '.')
plt.plot([0, 1], [0, 1], '--')
ax.set_xlabel("False Postive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
plt.show();
To model the decision tree three arguments are set:
Hyper Parameter tuning is performed with with different values of criterion, Max depth and min_samples_split to get select the best parameters for the decision tree.
The values considered are as follow:
from sklearn.tree import DecisionTreeClassifier
df_classifier = DecisionTreeClassifier(random_state=999)
params_DT = {'criterion': ['gini', 'entropy'],
'max_depth': [2,4,6,8,10],
'min_samples_split': [2,5,10]}
gs_DT = GridSearchCV(estimator=df_classifier,
param_grid=params_DT,
cv=cv_method,
verbose=1,
scoring='roc_auc')
gs_DT.fit(Data_sample_train, target_sample_train);
Fitting 5 folds for each of 30 candidates, totalling 150 fits
gs_DT.best_params_
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_split': 2}
gs_DT.best_score_
0.6126022182183725
The above output of Hyper Parameter Tuning suggest that the parameters for the Decision tree are:
results_DT = pd.DataFrame(gs_DT.cv_results_['params'])
results_DT['test_score'] = gs_DT.cv_results_['mean_test_score']
results_DT.columns
results_DT
| criterion | max_depth | min_samples_split | test_score | |
|---|---|---|---|---|
| 0 | gini | 2 | 2 | 0.588240 |
| 1 | gini | 2 | 5 | 0.588240 |
| 2 | gini | 2 | 10 | 0.588240 |
| 3 | gini | 4 | 2 | 0.606893 |
| 4 | gini | 4 | 5 | 0.606893 |
| 5 | gini | 4 | 10 | 0.606893 |
| 6 | gini | 6 | 2 | 0.612206 |
| 7 | gini | 6 | 5 | 0.612206 |
| 8 | gini | 6 | 10 | 0.612165 |
| 9 | gini | 8 | 2 | 0.608861 |
| 10 | gini | 8 | 5 | 0.609251 |
| 11 | gini | 8 | 10 | 0.609189 |
| 12 | gini | 10 | 2 | 0.600430 |
| 13 | gini | 10 | 5 | 0.600262 |
| 14 | gini | 10 | 10 | 0.600496 |
| 15 | entropy | 2 | 2 | 0.585060 |
| 16 | entropy | 2 | 5 | 0.585060 |
| 17 | entropy | 2 | 10 | 0.585060 |
| 18 | entropy | 4 | 2 | 0.605970 |
| 19 | entropy | 4 | 5 | 0.605970 |
| 20 | entropy | 4 | 10 | 0.605970 |
| 21 | entropy | 6 | 2 | 0.612602 |
| 22 | entropy | 6 | 5 | 0.612602 |
| 23 | entropy | 6 | 10 | 0.612602 |
| 24 | entropy | 8 | 2 | 0.610542 |
| 25 | entropy | 8 | 5 | 0.610331 |
| 26 | entropy | 8 | 10 | 0.610329 |
| 27 | entropy | 10 | 2 | 0.605056 |
| 28 | entropy | 10 | 5 | 0.604557 |
| 29 | entropy | 10 | 10 | 0.604165 |
for i in ['gini', 'entropy']:
temp = results_DT[results_DT['criterion'] == i]
temp_average = temp.groupby('max_depth').agg({'test_score': 'mean'})
plt.plot(temp_average, marker = '.', label = i)
plt.legend()
plt.xlabel('Max Depth')
plt.ylabel("Mean CV Score")
plt.title("DT Performance Comparison")
plt.show()
The figure above shows the Mean CV score value at various Max depth value. The red line represents Gini Index while blue line represents entropy criterion.
As it can be seen in the figure, gini index line rises above the entropy line till the max depth value is 5 while the entropy line continues continues to rise from max depth value 5. The entropy line reaches the maximum Mean CV Score at max depth value 6.
for i in results_DT['max_depth'].unique():
temp = results_DT[results_DT['max_depth'] == i]
plt.plot(temp['min_samples_split'], temp['test_score'], marker = '.', label = i)
plt.legend(title = "Max Depth")
plt.xlabel('Min Samples for Split')
plt.ylabel("AUC Score")
plt.title("DT Performance Comparison with 15 Features")
plt.show()
Final model is developed with the parameters derived from the Hyperparameter tuning
DT1= DecisionTreeClassifier(criterion='entropy',max_depth=6,min_samples_split=2, random_state=999)
DT = DT1.fit(Data_sample_train, target_sample_train)
DT.score(Data_sample_train, target_sample_train)
0.7845428571428571
pred_DT = DT.predict(Data_sample_test)
from sklearn.metrics import accuracy_score
ac_DT = accuracy_score (target_sample_test, pred_DT)
ac_DT
0.7838333333333334
print("\nConfusion matrix for Decision Tree")
print(metrics.confusion_matrix(target_sample_test, pred_DT))
cm_DT=confusion_matrix(target_sample_test, pred_DT)
Confusion matrix for Decision Tree [[23503 22] [ 6463 12]]
cmap = sb.cubehelix_palette(50, hue=0.05, rot=0, light=0.5, dark=0, as_cmap=False)
sb.heatmap(cm_DT, cmap = cmap,xticklabels=['Prediction Negetive','Prediction Positive'],yticklabels=['Actual Negetive','Actual Positive'], annot=True,
fmt='d')
<AxesSubplot:>
print("\nClassification report for Decision Tree")
print(metrics.classification_report(target_sample_test, pred_DT))
Classification report for Decision Tree
precision recall f1-score support
0 0.78 1.00 0.88 23525
1 0.35 0.00 0.00 6475
accuracy 0.78 30000
macro avg 0.57 0.50 0.44 30000
weighted avg 0.69 0.78 0.69 30000
metrics.roc_auc_score(target_sample_test, pred_DT)
0.5004590532539523
t_probDT = gs_DT.predict_proba(Data_sample_test)
t_probDT[0:10]
array([[0.84345282, 0.15654718],
[0.75415707, 0.24584293],
[0.80792683, 0.19207317],
[0.70774772, 0.29225228],
[0.87048377, 0.12951623],
[0.87048377, 0.12951623],
[0.76291226, 0.23708774],
[0.81085714, 0.18914286],
[0.90508021, 0.09491979],
[0.94153846, 0.05846154]])
fpr, tpr, _ = metrics.roc_curve(target_sample_test, t_probDT[:, 1])
roc_auc_DT = metrics.auc(fpr, tpr)
roc_auc_DT
0.6070092524587742
import pandas as pd
dfDT = pd.DataFrame({'FPR': fpr, 'TPR': tpr})
dfDT.head()
| FPR | TPR | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000128 | 0.000000 |
| 2 | 0.000468 | 0.000154 |
| 3 | 0.000638 | 0.001236 |
| 4 | 0.000723 | 0.001390 |
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
ax = dfDT.plot.line(x='FPR', y='TPR', title='ROC Curve', legend=False, marker = '.')
plt.plot([0, 1], [0, 1], '--')
ax.set_xlabel("False Postive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
plt.show();
from sklearn import tree
plt.figure(figsize= (200,200))
tree.plot_tree(DT,fontsize=7, filled = True)
plt.savefig('tree_high_dpi', dpi=100)
A probabilistic machine learning model called a Naive Bayes classifier is used to perform classification tasks. The Bayes theorem is at the heart of the classifier.
For each class, Naive Bayes predicts membership probabilities, such as the likelihood that a given record or data point belongs to that class. The most probable class is described as the one with the highest probability.
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import RandomizedSearchCV
np.random.seed(999)
nb_classifier = GaussianNB()
params_NB = {'var_smoothing': np.logspace(1,-3, num=100)}
n_iter_search = 20
gs_NB = RandomizedSearchCV(estimator=nb_classifier,
param_distributions=params_NB,
cv=cv_method,
verbose=1,
n_iter=n_iter_search,
scoring='roc_auc',
n_jobs=-2)
Data_transformed = PowerTransformer().fit_transform(Data_sample_train)
gs_NB.fit(Data_transformed,target_sample_train);
Fitting 5 folds for each of 20 candidates, totalling 100 fits
gs_NB.best_params_
{'var_smoothing': 0.004430621457583878}
gs_NB.best_score_
0.5740308691787148
results_NB = pd.DataFrame(gs_NB.cv_results_['params'])
results_NB['test_score'] = gs_NB.cv_results_['mean_test_score']
results_NB
| var_smoothing | test_score | |
|---|---|---|
| 0 | 10.000000 | 0.573323 |
| 1 | 0.017886 | 0.574029 |
| 2 | 0.673415 | 0.573817 |
| 3 | 0.351119 | 0.573916 |
| 4 | 1.707353 | 0.573583 |
| 5 | 0.890215 | 0.573754 |
| 6 | 0.010235 | 0.574030 |
| 7 | 0.265609 | 0.573948 |
| 8 | 0.045349 | 0.574021 |
| 9 | 1.873817 | 0.573563 |
| 10 | 3.944206 | 0.573420 |
| 11 | 0.054623 | 0.574017 |
| 12 | 0.291505 | 0.573938 |
| 13 | 0.008498 | 0.574031 |
| 14 | 0.151991 | 0.573983 |
| 15 | 0.019630 | 0.574029 |
| 16 | 8.302176 | 0.573337 |
| 17 | 0.006428 | 0.574031 |
| 18 | 0.739072 | 0.573797 |
| 19 | 0.004431 | 0.574031 |
plt.plot(results_NB['var_smoothing'], results_NB['test_score'], marker = '.')
plt.xlabel('Var. Smoothing')
plt.ylabel("Mean CV Score")
plt.title("NB Performance Comparison")
plt.show()
NB1= GaussianNB(var_smoothing= 0.004430621457583878)
NB = NB1.fit(Data_transformed, target_sample_train)
Data_test_transformed = PowerTransformer().fit_transform(Data_sample_test)
pred_NB = NB.predict(Data_test_transformed)
from sklearn.metrics import accuracy_score
ac_NB = accuracy_score (target_sample_test, pred_NB)
ac_NB
0.7147333333333333
print("\nConfusion matrix for Naive Bayes")
print(metrics.confusion_matrix(target_sample_test, pred_NB))
cm_NB = metrics.confusion_matrix(target_sample_test, pred_NB)
Confusion matrix for Naive Bayes [[20033 3492] [ 5066 1409]]
cmap = sb.cubehelix_palette(50, hue=0.05, rot=0, light=0.5, dark=0, as_cmap=False)
sb.heatmap(cm_NB, cmap = cmap,xticklabels=['Prediction Negetive','Prediction Positive'],yticklabels=['Actual Negetive','Actual Positive'], annot=True,
fmt='d')
<AxesSubplot:>
print("\nClassification report for Naive Bayes")
print(metrics.classification_report(target_sample_test, pred_NB))
Classification report for Naive Bayes
precision recall f1-score support
0 0.80 0.85 0.82 23525
1 0.29 0.22 0.25 6475
accuracy 0.71 30000
macro avg 0.54 0.53 0.54 30000
weighted avg 0.69 0.71 0.70 30000
metrics.roc_auc_score(target_sample_test, pred_NB)
0.5345841727563301
t_probNB = gs_NB.predict_proba(Data_sample_test)
t_probNB[0:10]
array([[0.62417836, 0.37582164],
[0.58186943, 0.41813057],
[0.5849428 , 0.4150572 ],
[0.57751315, 0.42248685],
[0.6060713 , 0.3939287 ],
[0.59107239, 0.40892761],
[0.62177917, 0.37822083],
[0.58339369, 0.41660631],
[0.62549282, 0.37450718],
[0.62834321, 0.37165679]])
fprNB, tprNB, _ = metrics.roc_curve(target_sample_test, t_probNB[:, 1])
roc_aucNB = metrics.auc(fprNB, tprNB)
roc_aucNB
0.5719794977002204
import pandas as pd
dfNB = pd.DataFrame({'fpr': fprNB, 'tpr': tprNB})
dfNB.head()
| fpr | tpr | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000043 | 0.000000 |
| 2 | 0.000340 | 0.000000 |
| 3 | 0.000340 | 0.000154 |
| 4 | 0.000468 | 0.000154 |
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
ax = dfNB.plot.line(x='fpr', y='tpr', title='ROC Curve', legend=False, marker = '.')
plt.plot([0, 1], [0, 1], '--')
ax.set_xlabel("False Postive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
plt.show();
Each model in the ensemble is trained on a random sample of the dataset known as bootstrap samples when we use bagging (or bootstrap aggregating).
Bagging is a machine learning ensemble meta-algorithm for increasing the accuracy and stability of machine learning algorithms used in statistical regression and classification.
Bagging and boosting are the two primary types of ensemble machine learning. Both regression and statistical classification can benefit from the bagging strategy. Bagging is used with decision trees to increase model stability in terms of reducing variance and enhancing accuracy, which avoids the problem of over fitting.
We determine the optimal value of number of estimations.
from sklearn.ensemble import BaggingClassifier
BG=BaggingClassifier()
params_BG={'n_estimators':[150,200,300]}
gs_BG=GridSearchCV(estimator = BG,
param_grid=params_BG,
cv=cv_method,
scoring='roc_auc',
n_jobs=-2,
verbose=1)
gs_BG.fit(Data_sample_train, target_sample_train);
Fitting 5 folds for each of 3 candidates, totalling 15 fits
gs_BG.best_params_
{'n_estimators': 300}
gs_BG.best_score_
0.5868791355323
results_BG = pd.DataFrame(gs_BG.cv_results_['params'])
results_BG['test_score'] = gs_BG.cv_results_['mean_test_score']
results_BG
| n_estimators | test_score | |
|---|---|---|
| 0 | 150 | 0.584591 |
| 1 | 200 | 0.584711 |
| 2 | 300 | 0.586879 |
plt.plot(results_BG['n_estimators'], results_BG['test_score'], marker = '.', label = i)
plt.xlabel('n_estimators')
plt.ylabel("AUC Score")
plt.title("Bagging Performance Comparison with 15 Features")
plt.show()
BG1= BaggingClassifier(n_estimators = 300)
BG= BG1.fit(Data_sample_train, target_sample_train)
pred_BG=BG.predict(Data_sample_test)
from sklearn.metrics import accuracy_score
ac_BG = accuracy_score (target_sample_test, pred_BG)
ac_BG
0.7710333333333333
print("\nConfusion matrix for Bagging")
print(metrics.confusion_matrix(target_sample_test, pred_BG))
cm_BG= metrics.confusion_matrix(target_sample_test, pred_BG)
Confusion matrix for Bagging [[22796 729] [ 6140 335]]
cmap = sb.cubehelix_palette(50, hue=0.05, rot=0, light=0.5, dark=0, as_cmap=False)
sb.heatmap(cm_BG, cmap = cmap,xticklabels=['Prediction Negetive','Prediction Positive'],yticklabels=['Actual Negetive','Actual Positive'], annot=True,
fmt='d')
<AxesSubplot:>
print("\nClassification report for Bagging")
print(metrics.classification_report(target_sample_test, pred_BG))
Classification report for Bagging
precision recall f1-score support
0 0.79 0.97 0.87 23525
1 0.31 0.05 0.09 6475
accuracy 0.77 30000
macro avg 0.55 0.51 0.48 30000
weighted avg 0.69 0.77 0.70 30000
metrics.roc_auc_score(target_sample_test, pred_BG)
0.5103745707146344
t_probBG = gs_BG.predict_proba(Data_sample_test)
t_probBG[0:10]
array([[0.74666667, 0.25333333],
[0.86 , 0.14 ],
[0.85333333, 0.14666667],
[0.71666667, 0.28333333],
[0.72 , 0.28 ],
[0.95 , 0.05 ],
[0.78333333, 0.21666667],
[0.66666667, 0.33333333],
[0.87666667, 0.12333333],
[0.95333333, 0.04666667]])
fprBG, tprBG, _ = metrics.roc_curve(target_sample_test, t_probBG[:, 1])
roc_aucBG = metrics.auc(fprBG, tprBG)
roc_aucBG
0.5800689712332646
import pandas as pd
dfBG = pd.DataFrame({'FPR': fprBG, 'TPR': tprBG})
dfBG.head()
| FPR | TPR | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000000 | 0.000154 |
| 2 | 0.000000 | 0.000309 |
| 3 | 0.000128 | 0.000309 |
| 4 | 0.000128 | 0.000463 |
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
ax = dfBG.plot.line(x='FPR', y='TPR', title='ROC Curve', legend=False, marker = '.')
plt.plot([0, 1], [0, 1], '--')
ax.set_xlabel("False Postive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
plt.show();
To model the Random Forest, two arguments are set, they are
Hyper Parameter tuning is performed with with different values of criterion and n_estimators to get select the best parameters for Random Forest.
The values considered are as follow:
from sklearn.ensemble import RandomForestClassifier
RF_classifier = RandomForestClassifier()
params_RF = {'criterion': ['gini', 'entropy'],
'n_estimators':[100,200,300]}
gs_RF = GridSearchCV(estimator=RF_classifier,
param_grid=params_RF,
cv=cv_method,
verbose=1,
scoring='roc_auc')
gs_RF.fit(Data_sample_train, target_sample_train);
Fitting 5 folds for each of 6 candidates, totalling 30 fits
gs_RF.best_params_
{'criterion': 'entropy', 'n_estimators': 200}
gs_RF.best_score_
0.5945758783380247
The above output after tuning suggest the optimum parameters for best models are:
with best score 0.5945758783380247
results_RF = pd.DataFrame(gs_RF.cv_results_['params'])
results_RF['test_score'] = gs_RF.cv_results_['mean_test_score']
results_RF
| criterion | n_estimators | test_score | |
|---|---|---|---|
| 0 | gini | 100 | 0.589789 |
| 1 | gini | 200 | 0.590979 |
| 2 | gini | 300 | 0.592621 |
| 3 | entropy | 100 | 0.590072 |
| 4 | entropy | 200 | 0.594576 |
| 5 | entropy | 300 | 0.594157 |
plt.plot(results_RF['criterion'], results_RF['test_score'], marker = '.')
plt.xlabel('criterion')
plt.ylabel("AUC Score")
plt.title("RF Performance Comparison with 15 Features")
plt.show()
for i in ['gini', 'entropy']:
temp = results_RF[results_RF['criterion'] == i]
temp_average = temp.groupby('n_estimators').agg({'test_score': 'mean'})
plt.plot(temp_average, marker = '.', label = i)
plt.legend()
plt.xlabel('Estimators')
plt.ylabel("Mean CV Score")
plt.title("RF Performance Comparison")
plt.show()
The figure above shows the Mean CV score value at various n-estimators. The red line represents Gini Index while blue line represents entropy criterion.
As it can be seen in the figure, entropy line rises above the Gini index line as the n estimator increases while the the gini index rises below the entropy line with less Mean CV score as compared to entropy criterion
RF1= RandomForestClassifier(criterion='entropy', n_estimators = 300)
RF= RF1.fit(Data_sample_train, target_sample_train)
pred_RF=gs_RF.predict(Data_sample_test)
from sklearn.metrics import accuracy_score
ac_RF = accuracy_score (target_sample_test, pred_RF)
ac_RF
0.7743666666666666
print("\nConfusion matrix for RF")
print(metrics.confusion_matrix(target_sample_test, pred_RF))
cmRF = metrics.confusion_matrix(target_sample_test, pred_RF)
Confusion matrix for RF [[23006 519] [ 6250 225]]
cmap = sb.cubehelix_palette(n_colors=10, hue=0.05, rot=0, light=0.5, dark=0, as_cmap=False)
sb.heatmap(cmRF, cmap = cmap,xticklabels=['Prediction Negetive','Prediction Positive'],yticklabels=['Actual Negetive','Actual Positive'], annot=True,
fmt='d')
<AxesSubplot:>
print("\nClassification report for RF")
print(metrics.classification_report(target_sample_test, pred_RF))
Classification report for RF
precision recall f1-score support
0 0.79 0.98 0.87 23525
1 0.30 0.03 0.06 6475
accuracy 0.77 30000
macro avg 0.54 0.51 0.47 30000
weighted avg 0.68 0.77 0.70 30000
metrics.roc_auc_score(target_sample_test, pred_RF)
0.5063436990960901
t_probRF = gs_RF.predict_proba(Data_sample_test)
t_probRF[0:10]
array([[0.84 , 0.16 ],
[0.735, 0.265],
[0.8 , 0.2 ],
[0.7 , 0.3 ],
[0.77 , 0.23 ],
[0.96 , 0.04 ],
[0.755, 0.245],
[0.725, 0.275],
[0.87 , 0.13 ],
[0.93 , 0.07 ]])
fprRF, tprRF, _ = metrics.roc_curve(target_sample_test, t_probRF[:, 1])
roc_aucRF = metrics.auc(fprRF, tprRF)
roc_aucRF
0.5869048798000976
import pandas as pd
dfRF = pd.DataFrame({'fpr': fprRF, 'tpr': tprRF})
dfRF.head()
| fpr | tpr | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000000 | 0.000154 |
| 2 | 0.000000 | 0.000618 |
| 3 | 0.000085 | 0.000618 |
| 4 | 0.000128 | 0.000772 |
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
ax = dfRF.plot.line(x='fpr', y='tpr', title='ROC Curve', legend=False, marker = '.')
plt.plot([0, 1], [0, 1], '--')
ax.set_xlabel("False Postive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
plt.show();
r_probs = [0 for _ in range(len(target_sample_test))]
rf_probs = gs_RF.predict_proba(Data_sample_test)
nb_probs = gs_NB.predict_proba(Data_sample_test)
knn_probs= gs_KNN.predict_proba(Data_sample_test)
bg_probs =gs_BG.predict_proba(Data_sample_test)
dt_probs =gs_DT.predict_proba(Data_sample_test)
from sklearn.metrics import roc_curve, roc_auc_score
rf_probs = rf_probs[:, 1]
nb_probs = nb_probs[:, 1]
knn_probs= knn_probs[:, 1]
bg_probs = bg_probs[:, 1]
dt_probs = dt_probs[:, 1]
r_auc = roc_auc_score(target_sample_test, r_probs)
rf_auc = roc_auc_score(target_sample_test, rf_probs)
nb_auc = roc_auc_score(target_sample_test, nb_probs)
knn_auc = roc_auc_score(target_sample_test, knn_probs)
bg_auc = roc_auc_score(target_sample_test, bg_probs)
dt_auc = roc_auc_score(target_sample_test, dt_probs)
print('Random (chance) Prediction: AUCROC = %.3f' % (r_auc))
print('Random Forest: AUCROC = %.3f' % (rf_auc))
print('KNN: AUROC = %.3f' % (knn_auc))
print('Bagging: AUCROC = %.3f' % (bg_auc))
print('Decision Tree: AUCROC = %.3f' % (dt_auc))
print('Naive Bayes : AUCROC = %.3f' % (nb_auc))
Random (chance) Prediction: AUCROC = 0.500 Random Forest: AUCROC = 0.587 KNN: AUROC = 0.590 Bagging: AUCROC = 0.580 Decision Tree: AUCROC = 0.607 Naive Bayes : AUCROC = 0.572
r_fpr, r_tpr, _ = roc_curve(target_sample_test, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(target_sample_test, rf_probs)
nb_fpr, nb_tpr, _ = roc_curve(target_sample_test,nb_probs )
knn_fpr, knn_tpr, _ = roc_curve(target_sample_test,knn_probs )
bg_fpr, bg_tpr, _ = roc_curve(target_sample_test,bg_probs )
dt_fpr, dt_tpr, _ = roc_curve(target_sample_test,dt_probs )
plt.plot(r_fpr, r_tpr, linestyle='--', label='Random prediction (AUROC = %0.3f)' % r_auc)
plt.plot(rf_fpr, rf_tpr, marker='.', label='Random Forest (AUROC = %0.3f)' % rf_auc)
plt.plot(nb_fpr, nb_tpr, marker='.', label='Naive Bayes (AUROC = %0.3f)' % nb_auc)
plt.plot(knn_fpr, knn_tpr, marker='.', label='KNN (AUROC = %0.3f)' % knn_auc)
plt.plot(bg_fpr, bg_tpr, marker='.', label='Bagging (AUROC = %0.3f)' % bg_auc)
plt.plot(dt_fpr, dt_tpr, marker='.', label='Decision Tree (AUROC = %0.3f)' % dt_auc)
plt.title('ROC Plot')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
The above figure displays the AUC-ROC curve and scores of each selected predictive models.The ROC plot suggests that ROC curve of each model is just above the random prediction line although, among all the selected models Decision Tree performs the best with an AUC score of 0.607 each.
ac_KNN = accuracy_score (target_sample_test, pred_KNN)
ac_DT = accuracy_score (target_sample_test, pred_DT)
ac_NB = accuracy_score (target_sample_test, pred_NB)
ac_BG = accuracy_score (target_sample_test, pred_BG)
ac_RF = accuracy_score (target_sample_test, pred_RF)
print("Accuracy score of KNN_Model: %0.2f" %(100*ac_KNN) + '%')
print("Accuracy score of Decision Tree Model: %0.2f" %(100*ac_DT) + '%')
print("Accuracy score of Naive Bayes Model: %0.2f" %(100*ac_NB) + '%')
print("Accuracy score of Bagging Model: %0.2f" %(100*ac_BG) + '%')
print("Accuracy score of Random Forrest Model: %0.2f" %(100*ac_RF) + '%')
Accuracy score of KNN_Model: 78.34% Accuracy score of Decision Tree Model: 78.38% Accuracy score of Naive Bayes Model: 71.47% Accuracy score of Bagging Model: 77.10% Accuracy score of Random Forrest Model: 77.44%
Above output suggest that the decision tree scored the best out of all the other models with about 78.38 % accuracy followed by a close score of 78.34% and 77.44% by KNN Model and Random Forrest Model respectively.
We use the following metrics to measure the performance of the models with the Test dataset:
print("\nClassification report for KNN")
print(metrics.classification_report(target_sample_test, pred_KNN))
print("\nClassification report for DT")
print(metrics.classification_report(target_sample_test, pred_DT))
print("\nClassification report for NB")
print(metrics.classification_report(target_sample_test, pred_NB))
print("\nClassification report for BG")
print(metrics.classification_report(target_sample_test, pred_BG))
print("\nClassification report for RF")
print(metrics.classification_report(target_sample_test, pred_RF))
Classification report for KNN
precision recall f1-score support
0 0.78 1.00 0.88 23525
1 0.37 0.00 0.01 6475
accuracy 0.78 30000
macro avg 0.58 0.50 0.44 30000
weighted avg 0.70 0.78 0.69 30000
Classification report for DT
precision recall f1-score support
0 0.78 1.00 0.88 23525
1 0.35 0.00 0.00 6475
accuracy 0.78 30000
macro avg 0.57 0.50 0.44 30000
weighted avg 0.69 0.78 0.69 30000
Classification report for NB
precision recall f1-score support
0 0.80 0.85 0.82 23525
1 0.29 0.22 0.25 6475
accuracy 0.71 30000
macro avg 0.54 0.53 0.54 30000
weighted avg 0.69 0.71 0.70 30000
Classification report for BG
precision recall f1-score support
0 0.79 0.97 0.87 23525
1 0.31 0.05 0.09 6475
accuracy 0.77 30000
macro avg 0.55 0.51 0.48 30000
weighted avg 0.69 0.77 0.70 30000
Classification report for RF
precision recall f1-score support
0 0.79 0.98 0.87 23525
1 0.30 0.03 0.06 6475
accuracy 0.77 30000
macro avg 0.54 0.51 0.47 30000
weighted avg 0.68 0.77 0.70 30000
All models have a F1 Score greater than 0.85 which is a good sign. As we check for a balance between precision and recall this F1 Score looks good.
Here we will compare the models on the basis of loan default prediction (prediction of '1'). Precisely, the comparison will be performed on the basis of Precision, Recall value, F1-Score,overall accuracy of the model and overall ROC-AUC curve score.
metrics = pd.DataFrame (index= ['KNN','DT', 'NB', 'BG', 'RF', ],
columns= ['accuracy', 'precision', 'recall', 'ROC-AUC', 'f1_score'])
metrics.loc['KNN', 'accuracy'] = accuracy_score (target_sample_test, pred_KNN)
metrics.loc['DT', 'accuracy'] = accuracy_score (target_sample_test, pred_DT)
metrics.loc['NB' , 'accuracy'] = accuracy_score (target_sample_test, pred_NB)
metrics.loc['BG', 'accuracy'] = accuracy_score (target_sample_test, pred_BG)
metrics.loc['RF', 'accuracy'] = accuracy_score (target_sample_test, pred_RF)
from sklearn.metrics import precision_score
metrics.loc['KNN', 'precision']= precision_score(y_true = target_sample_test, y_pred = pred_KNN)
metrics.loc['DT','precision']= precision_score(y_true = target_sample_test, y_pred = pred_DT)
metrics.loc['NB' ,'precision']= precision_score(y_true = target_sample_test, y_pred = pred_NB)
metrics.loc['BG', 'precision']= precision_score(y_true = target_sample_test, y_pred = pred_BG)
metrics.loc['RF', 'precision']= precision_score(y_true = target_sample_test, y_pred = pred_RF)
from sklearn.metrics import recall_score
metrics.loc['KNN' ,'recall']= recall_score(y_true = target_sample_test, y_pred = pred_KNN)
metrics.loc['DT' , 'recall']= recall_score(y_true = target_sample_test, y_pred = pred_DT)
metrics.loc['NB' , 'recall']= recall_score(y_true = target_sample_test, y_pred = pred_NB)
metrics.loc['BG', 'recall']= recall_score(y_true = target_sample_test, y_pred = pred_BG)
metrics.loc['RF', 'recall']= recall_score(y_true = target_sample_test, y_pred = pred_RF)
metrics.loc['KNN', 'ROC-AUC']= roc_auc_score(target_sample_test, knn_probs)
metrics.loc['DT', 'ROC-AUC']= roc_auc_score(target_sample_test, dt_probs)
metrics.loc['NB', 'ROC-AUC']= roc_auc_score(target_sample_test, nb_probs)
metrics.loc['BG', 'ROC-AUC']= roc_auc_score(target_sample_test, bg_probs)
metrics.loc['RF', 'ROC-AUC']= roc_auc_score(target_sample_test, rf_probs)
from sklearn.metrics import f1_score
metrics.loc['KNN' ,'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_KNN)
metrics.loc['DT' , 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_DT)
metrics.loc['NB' , 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_NB)
metrics.loc['BG', 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_BG)
metrics.loc['RF', 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_RF)
metrics = metrics*100
metrics
| accuracy | precision | recall | ROC-AUC | f1_score | |
|---|---|---|---|---|---|
| KNN | 78.343333 | 37.209302 | 0.494208 | 59.026131 | 0.975461 |
| DT | 78.383333 | 35.294118 | 0.185328 | 60.700925 | 0.36872 |
| NB | 71.473333 | 28.749235 | 21.760618 | 57.19795 | 24.771449 |
| BG | 77.103333 | 31.484962 | 5.173745 | 58.006897 | 8.88712 |
| RF | 77.436667 | 30.241935 | 3.474903 | 58.690488 | 6.23355 |
fig, ax = mpt.subplots(figsize = (12,10))
metrics.plot(kind='barh', ax=ax)
mpt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
ax.grid
<bound method _AxesBase.grid of <AxesSubplot:>>
The plot shows reading for the target variable .i.e. 1 which means the customer is a defaulter.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold, GridSearchCV
cv_method_ttest = RepeatedStratifiedKFold(n_splits=10,n_repeats=5,random_state=999)
cv_results_KNN = cross_val_score(estimator=gs_KNN.best_estimator_,
X=Data_sample_test,
y=target_sample_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_KNN.mean()
0.582182674097511
cv_results_DT = cross_val_score(estimator=gs_DT.best_estimator_,
X=Data_sample_test,
y=target_sample_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_DT.mean()
0.5975782784885272
Data_sample_test_transformed = PowerTransformer().fit_transform(Data_sample_test)
cv_results_NB = cross_val_score(estimator=gs_NB.best_estimator_,
X=Data_sample_test_transformed,
y=target_sample_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_NB.mean()
0.5733004002491597
cv_results_BG = cross_val_score(estimator=gs_BG.best_estimator_,
X=Data_sample_test,
y=target_sample_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_BG.mean()
0.579335821129004
cv_results_RF = cross_val_score(estimator=gs_RF.best_estimator_,
X=Data_sample_test,
y=target_sample_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_RF.mean()
0.5849732765886713
from scipy import stats
print(stats.ttest_rel(cv_results_KNN, cv_results_NB))
print(stats.ttest_rel(cv_results_KNN, cv_results_DT))
print(stats.ttest_rel(cv_results_KNN,cv_results_BG))
print(stats.ttest_rel(cv_results_KNN, cv_results_RF))
print(stats.ttest_rel(cv_results_DT, cv_results_BG))
print(stats.ttest_rel(cv_results_DT, cv_results_NB))
print(stats.ttest_rel(cv_results_DT, cv_results_RF))
print(stats.ttest_rel(cv_results_NB, cv_results_RF))
print(stats.ttest_rel(cv_results_NB, cv_results_BG))
print(stats.ttest_rel(cv_results_BG, cv_results_RF))
Ttest_relResult(statistic=4.382630600966168, pvalue=6.182279102834216e-05) Ttest_relResult(statistic=-9.266989010797381, pvalue=2.3672292072983096e-12) Ttest_relResult(statistic=1.4633217673671863, pvalue=0.14976498604645117) Ttest_relResult(statistic=-1.5363112970669919, pvalue=0.1308947466697605) Ttest_relResult(statistic=9.755204053843135, pvalue=4.553509059309014e-13) Ttest_relResult(statistic=15.334889776896857, pvalue=2.463338061036852e-20) Ttest_relResult(statistic=7.2442621609815525, pvalue=2.7780756945755856e-09) Ttest_relResult(statistic=-5.3878854907125975, pvalue=2.020436947932662e-06) Ttest_relResult(statistic=-2.51776788527314, pvalue=0.015128341459683493) Ttest_relResult(statistic=-6.666726096097443, pvalue=2.1787123525842363e-08)
The p value less than 0.05 shows it is statistically significant. Here from the results only KNN and DT have p value greater than 0.05 indicating that at 95% significance level, the difference between the score even though too small is not statistically significant and the performance is comparable. Rest all models when compared to each other have p value less than 0.05 and the difference is statistically significant.
We check and understand that there is an imbalance with the Target value with 78% of the variable being the negative attribute.
Target.value_counts(normalize=True)
0 0.782845 1 0.217155 Name: loan_default, dtype: float64
print('Positive class is only 21%. There surely is a class imbalance problem.')
print('In this case, F1-Score or AUC would be preferred.')
Positive class is only 21%. There surely is a class imbalance problem. In this case, F1-Score or AUC would be preferred.
The aim of this project was to try and test five supervised machine learning models to predict loan default made by a borrower at first EMI.
The considered Models were K- Nearest Neighbor, Naive Bayes, Decision Tree, Bagging and Random Forest.
In Phase 1 of the project we carried out Data Preprocessing and Visualization. The dataset was made ready for further modeling and prediction in Phase 2. The unnecessary attributes where removed from the dataset and necessary preprocessing and transformation was done on the required attributes. The relationship between features were explored through visualization.
Phase 2 of this project focused on building supervised machine learning models for predicting the loan default made by a borrower. The following classification models where designed, tuned and tested :
Performance comparison by ROC curve and the F1 Score were performed to assess the best model.
metrics1 = pd.DataFrame (index= ['KNN','DT', 'NB', 'BG', 'RF', ],
columns= ['ROC-AUC', 'f1_score'])
metrics1.loc['KNN' ,'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_KNN)
metrics1.loc['DT' , 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_DT)
metrics1.loc['NB' , 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_NB)
metrics1.loc['BG', 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_BG)
metrics1.loc['RF', 'f1_score']= f1_score(y_true = target_sample_test, y_pred = pred_RF)
metrics1.loc['KNN', 'ROC-AUC']= roc_auc_score(target_sample_test, knn_probs)
metrics1.loc['DT', 'ROC-AUC']= roc_auc_score(target_sample_test, dt_probs)
metrics1.loc['NB', 'ROC-AUC']= roc_auc_score(target_sample_test, nb_probs)
metrics1.loc['BG', 'ROC-AUC']= roc_auc_score(target_sample_test, bg_probs)
metrics1.loc['RF', 'ROC-AUC']= roc_auc_score(target_sample_test, rf_probs)
metrics1 = metrics1*100
metrics1
| ROC-AUC | f1_score | |
|---|---|---|
| KNN | 59.026131 | 0.975461 |
| DT | 60.700925 | 0.36872 |
| NB | 57.19795 | 24.771449 |
| BG | 58.006897 | 8.88712 |
| RF | 58.690488 | 6.23355 |
fig, ax = mpt.subplots(figsize = (12,10))
metrics1.plot(kind='barh', ax=ax)
mpt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
ax.grid
<bound method _AxesBase.grid of <AxesSubplot:>>
From all the test we can conclude that the Decision Tree Model with 15 features has the highest ROC_AUC with the test dataset and while checking the ROC_AUC score for the train dataset Decision Tree model itself has the highest score and out performs all other models. We also observe that the F1 score for all the models are similar for the Target variable 0(non-defaulters) and for defaulters, 1 NB model has the highest f1 score. Further model deployment can be done using the Decision Tree model. We are considering only 15 features here which can be changed with other feature selection methods to find better conclusions.
From the modeling and fitting we can conclude that Decision Tree is the best model for this dataset comparing the other models taken in the project. We do feature selection and use the 15 features selected to model and fit the dataset. we can use the Decision tree algorithm for further deployment of the dataset which is not in the scope of this project and can be done in future. Hence we can conclude that using the Decision Tree algorithm we can predict if a borrower will default on a vehicle loan in the first EMI( Equated monthly installment) on the due date.